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Accurate Goodness of Fit tests for the extreme tails of empirical distributions is a very im- 
portant issue, relevant in many contexts, including geophysics, insurance and finance. We have 
derived exact asymptotic results for a generalization of the Kolmogorov-Smirnov test, well suited 
to test these extreme tails. In passing, we have rederived an d made more precise the result of 
l[P. L. Krapivskv and S. Redner. Am. J. Phus. 64(51:546. 1996]! concerning the survival probability 
of a diffusive particle in an expanding cage. 
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I. INTRODUCTION AND MOTIVATION 



II. EMPIRICAL CUMULATIVE 
DISTRIBUTION AND ITS FLUCTUATIONS 
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The problem of testing whether a null-hypothesis the- 
oretical probability distribution is compatible with the 
empirical probability distribution of a sample of ob- 
servations is known as "Goodness-of-fit testing" and is 
ubiquitous in all fields of science and engineering. The 
best known theoretical result is due to Kolmogorov and 
Smirnov (KS) 3,01, and has led to the eponymous sta- 
tistical test. Several specific cases have been studied 
(and/or are still under scrutiny), including: univariate 
or multivariate samples 043; independent or dependent 
data @, different choices of distance measures [9j], in- 
vestigation of different parts of the distribution domain 
[IE El, etc. 

This class of problems has a particular appeal for 
physicists since the works of Doob [l2| and Khmaladze 
[13| . who have shown how GoF testing is related to 
stochastic processes. Finding the law of a test often 
amounts to treating a Fokker- Planck problem, which in 
turn maps into a Schrodinger equation for a particle in a 
certain potential confined by walls. 

The classical KS test suffers from an important flaw: 
the test is only weakly sensitive to the quality of the 
fit in the tails of the tested distribution, when it is of- 
ten these tail events (corresponding to centennial floods, 
devastating earthquakes, financial crashes, etc.) that one 
is most concerned with. Here we focus on a GoF test 
for a univariate sample, with the Kolmogorov distance 
but equi-weighted quantiles, which is equally sensitive to 
all regions of the distribution. We unify two earlier at- 
tempts at finding asymptotic solutions, one by Anderson 
and Darling in 1952 [lOj and a more recent, seemingly 
unrelated one that deals with "life and death of a par- 
ticle in an expanding cage" by Krapivsky and Redner 
0, [13] ■ We present here the exact asymptotic solution of 
the corresponding stochastic problem, and deduce from 
it the precise formulation of the GoF test, which is of a 
fundamentally different nature than the KS test. 



Let X be a latent random vector of N iid vari- 
ables, with marginal cumulative distribution function 
(cdf) F. One realization of X consists of a time series 
{xi, . . . , x n , . . . , xn} that exhibits no persistence (see Q 
when some non trivial dependence is present). For a 
given number x in the support of F, let Y(x) be the ran- 
dom vector the components of which are the Bernoulli 
variables Y n (x) = l{x„<x}- The expected value and the 
covariance of Y n {x) are given by: 

E[Y n (z)] = F(x), 
Cov(Y n (x),Y m (x')) = F(x)F(x')S nm 

The centered sample mean of Y(x) is: 



_ 1 N 

Y(*) = -^Y n (x)-nz) 



(1) 



which measures the difference between the empirically 
determined cdf at point x and its true value. It is there- 
fore the quantity on which any statistics for Goodness- 
of-Fit testing is built. Denoting u — F(x) and v — F(x'), 
the covariance function of Y is easily shown to be: 

Cov(Y(w), Y(v)) = — (min(tt,u) — uv) 

(with a slight abuse of notation), where now and in the 
following 



Y(u) 



1 N 

-Y,Y n (F-\u)) 



d) 



n=l 



Limit properties 

One now defines the process y(u) as the limit of 
NY(u) when N —¥ oo. For a given u, it represents 
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the difference between the empirically determined cdf of 
the (infinitely many) X's and the theoretical one, evalu- 
ated at the u-th quantile. According to the Central Limit 
Theorem, it is Gaussian and its covariance function is 
given by: 



I(u, v) = nhn(u, v) — uv 



(2) 



which characterizes the so-called Brownian bridge, i.e. a 
Brownian motion y(u) such that y(u = 0) = y(u = l) = 0. 



Norms over processes and the Kolmogorov-Smirnov 

test 

In order to measure a limit distance between distribu- 
tions, a norm ||.|| over the space of continuous bridges 
needs to be chosen. Interestingly, Eq. ([2]) does not de- 
pend explicitly on F, so that the law of \\y\\ under any 
norm is distribution free. 

Typical such norms are the norm-2 (or 'Cramer- von 
Mises' distance) 

1Mb = / y(u) 2 du 
Jo 

as the bridge is always integrable, or the norm-sup 

IMloo = su p M w )l 

ue [o,i] 

as the bridge always reaches an extremal value (also 
called the Kolmogorov distance). Unfortunately, both 
these norms mechanically overweight the core values 
u ~ 1/2 and disfavor the tails u « 0, 1: since the variance 
of y(u) is zero at both extremes and maximal in the cen- 
tral value, the major contribution to |MI indeed comes 
from the central region. In order to alleviate this effect — 
in particular when the GoF test is intended to investigate 
a specific region of the domain — , it is preferable to in- 
troduce additional weights and study Hj/v^II rather than 
|MI itself. Anderson and Darling show in Ref. [T3| that 
the solution to the problem with the Cramer- von Mises 
norm and arbitrary weights ip is obtained by spectral de- 
composition of the covariance kernel, and use of Mercer's 
theorem. In this note we will rather focus on the case 
ip{u) — 1/Y[y(u)], which equi-weights all quantiles, and 
with the Kolmogorov distance, for which (to the best of 
our knowledge) no exact result has been reported in the 
literature. 



III. THE WEIGHTED BROWNIAN BRIDGE: 
LAW OF THE SUPREMUM 



So again y(u) is a Brownian bridge, i.e. a centered 
Gaussian process on u € [0, 1] with covariance function 

Cov(y(u), y(u')) = min(u, v!) — uv! . 



In particular, y(0) = y(l) = with probability equal 
to 1, no matter how distant F is from the sample cdf 
around the core values. In order to zoom on these tiny 
differences in the tails, we weight the Brownian bridge as 
follows: for given o£]0,l[ and b £ [a, 1[, we define 



y(u) = y(u)^tp(u;a,b) 



(3) 



with 



4>{u; a, b) 



—rr — r, a < u < b 

u(l — u) ' — — 

0, otherwise. 



We will characterize the law of the supremum K(a, b) = 
suPue[o,i] \v( u )\ : 

P<(fe|o,6) = F[K(a,b) < k] 

= P[\y(u)\ < fc,Vue [a,b]]. 



Diffusion in a cage with moving walls 

Define the time change t = jz—- The variable W(t) = 
(1 + t) y(^jj^j is then a Brownian motion (Wiener pro- 
cess) on [jz^i T~h]i since one can check that: 

Gov (W(t),W(t')) =min(t,t'). 
V < (k\a,b) can be now written as 

V < (k\a,b)='p[\W(t)\<kVi,ytG[ T ^, 1 ^] 

Remarks: 

• The problem with initial time rz- = and hori- 
zon time = T nas been treated by Krapivsky 
and Redner in Ref. [l[ as the survival probability 

S(T; k — of a Brownian particle diffusing 

with constant D in a cage with walls expanding as 
y/~At. Their result is that for large T, 

S(T;k) = P < (k|0, I ^)ocT- fl W. 

They obtain analytical expressions for 8(k) in both 
asymptotic limits k — > and k — > oo. We take here 
a slightly different route, suggested by Anderson 
and Darling in Ref. [lfj but where the authors did 
not come to a conclusion. Our contributions are: 
(i) we treat the general case a > for any k; (ii) 
we explicitly compute the fc-dependence of both the 
exponent and the prefactor of the power-law decay; 
(hi) we provide the link with the theory of GoF 
tests and compute the pre-asymptotic distribu- 
tion when ]a, b[—t]0, 1[ of the weighted Kolmogorov- 
Smirnov test statistics. 
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• Choosing a constant weight function ip instead of 
the one above corresponds to the usual KS case 
and leads, after appropriate change of variable and 
time change, to a similar problem of a Brownian 
diffusion inside a box with walls moving at constant 
velocity. Since the walls now expand as Vt faster 
than the diffusive particle can move, the survival 
probability clearly decays to a positive value. The 
resulting survival probability turns out to be the 
usual Kolmogorov-Smirnov distribution. 

• Other choices of ip apparently result in much harder 
problems, see Ref. [10]. 



An Ornstein-Uhlenbeck process with fixed walls 



Introducing now the new time change t = log y t, 

the variable Z(t) — W(t)/y/i is a stationary Ornstein- 
Uhlenbeck process on [0,T] where 



7 = log , 



and 



Cov(Z(t),Z(t')) =e 



lb(l-a) 
o(l - b) ' 



'W = P -\-r-r'\ 



(4) 



Its dynamics is described by the Stochastic Differential 
Equation 



dZ(T) = -Z(T)dT + V2dB(T) 



(5) 



with B(T) an independent Wiener process. The initial 
condition for T = (corresponding to b = a) is Z(0) = 
J/(o)/\/V[2/(o)], a random Gaussian variable of zero mean 
and unit variance. The distribution V < (k\a, b) can now be 
understood as the unconditional survival probability of 
a mean-reverting particle in a cage with fixed absorbing 
walls: 

V < (k\T) = P [-k < Z{t) < k, Vr G [0, T]] 

rk 

f T (z;k)dz 

J-k 
where 

f T (z; k)dz =V[Z(T) e[z,z + dz[\ {Z(t)} t<t ] 

is the density probability of the particle being at z at time 
T, when walls are in ±k. Its dependence on k, although 
not explicit on the right hand side, is due to the boundary 
condition associated with the absorbing walls (it will be 
dropped in the following for the sake of readability) [ll| . 

The Fokker-Planck equation governing the evolution of 
the density fr{z) reads 

drfr(z) = Z [Z f T (z)} + 31 IfAz)} , < T < T. 



Calling %fp the second order differential operator 
— [l + zd z + d% 1 , the full problem thus amounts to find- 
ing the general solution of 

-d T f T (z)=H FP {z)f T (z) 
M±k) =0,Vre [0,7] ■ 

We have explicitly introduced a minus sign since we ex- 
pect that the density decays with time in an absorption 
problem. Because of the term zd z , Hfp is not hermi- 
tian and thus cannot be diagonalized. However, as is 

well known, one can define f T (z) — e~~ <p T (z) and the 
Fokker-Planck equation becomes 

-d T <p T {z) = [_a 2 2 + iz 2 - ii] 

b T (±k) = 0,Vr G [0,7] 

and its Green function, i.e. the (separable) solution con- 
ditionally on the initial position (z;,Ti), is the superpo- 
sition of all modes 

G 4> (z,7\ Zi,T0 = J2e~ 9 " iT ~ Ti) Mz)$»(zi)> 



where (p v are the normalized solutions of the stationary 
Schrodinger equation: 

f[-a z 2 + ^ 2 ] <p v {z) = {9 u + \)<p v {z) 
\<p v (±k) = a 

each decaying with its own energy 9 U , where v labels the 
different solutions with increasing eigenvalues, and the 
set of cigenfunctions {ip v } defines an orthonormal basis 
of the Hilbert space on which 7is(z) = [— d z + \z 2 ] acts. 
In particular, 



tp u (z)@ v (z') = S(z - z'), 



(6) 



so that indeed G(z, Ti | Z\, 21) = 8{z — Zi), and the general 
solution writes 



f T {zT]k) = 



e 4 



G^(zr,T I Zi,Ti) fo(zi)dZi 



k z 2_,2 

e^^G^ZT,T | Zi ,Ti) f (zi)dzi 

-k 

where 7\ — 0, which corresponds to the case b = a in 
Eq. ([3]), and fo is the distribution of the initial value z- x 
which is here, as noted above, Gaussian with unit vari- 
ance. 

Hs figures out an harmonic oscillator of mass | and 
frequency u> = within an infinitely deep well of width 
2k: its eigenfunctions are parabolic cylinder functions 

©tin 



y+{d-z)= e-V^^-f,!,^ 
y _{6;z) = ze-^ 1 F 1 {^,l^) 
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properly normalized. The only acceptable solutions for 
a given problem are the linear combinations of y + and 
?/„ which satisfy orthonormality ([6]) and the boundary 
conditions: for periodic boundary conditions, only the 
integer values of 9 would be allowed, whereas with our 
Dirichlet boundaries = —\QjA—k)\ = 0, real non- 

integer eigenvalues 9 are allowed 20]. For instance, the 
fundamental level v = is expected to be the symmet- 
ric solution (po(z) oc y+(9o\ z) with 9 the smallest possi- 
ble value compatible with the boundary condition. The 
boundary condition in fact provides the implicit equation 
for 6» : 



%{k) = M{9:y+(9;k)=0}. 

6>0 



(7) 



In what follows, it will be more convenient to make the 
fc-dependence explicit, and a hat will denote the solution 
with the normalization relevant to our problem, namely 
Lp (z; k) = y + (9 (k);z)/\\y + \\k, with the norm 



As k goes to infinity, the absorption rate 
9o(k) is expected to converge toward 0: intuitively, an 
infinitely far barrier will not absorb anything. At the 
same time, V < (k\T) must tend to 1 in that limit. So 
A(k) necessarily tends to one. Indeed, 



(9) 



A(k) ^=^> 



(pa{z; oo) 2 dz 



In principle, we see from Eq. ([8]) that correc- 
tions to the later arise both (and jointly) from 
the functional relative difference of the solution 
e(z; k) — y + (9o(k); z)/y + (0; z) ~ 1, and from the finite in- 
tegration limits (±k instead of ±oo). However, it turns 
out that the correction of the first kind is of second order 
in e The correction to A(k) is thus dominated by the 
finite integration limits (±fc), so that pre-asymptotically: 



\y+\\l= I y+{9 {k)-zfdz 

-k 



that J_ k (p„(z; k) 2 dz = 1. 



Asymptotic survival rate 

Denoting by A u (k) = [9 u (k) — 9o{k)) the gap between 
the excited levels and the fundamental, the higher energy 
modes (p u cease to contribute to the Green function when 
A„T 3> 1, and their contribution to the above sum die 
out exponentially as T grows. Eventually, only the lowest 
energy mode 9o(k) remains, and the solution tends to 

/ T (z;fc) = #)e-^„(^)e- 9 « T 
when T > (Ai) -1 , with 



A(k 



oo 



erf 



k 

71 



(10) 



k — > For small k, the system behaves like a free 



particle in a sharp and infinitely deep well, since the 
quadratic potential is almost flat around 0. The fun- 
damental mode becomes then 



(p (z; k 
and consequently 



0) 



1 /7TZ 
— = COS — 

Vk V2fc 



Ik) 



fc-S-0 7T 



A{k) -^A 



fll) 




(12) 



A(k) 



e i fio(zi; k)fo{z{)dz\. 



(8) 



Let us come back to the initial problem of the weighted 
Brownian bridge reaching its extremal value in [a,b]. If 
we are interested in the limit case where a is arbitrarily 
close to and b close to 1, then T — > oo and the solution 
is thus given by: 



V<{k\T) = A(k)e- 0O< -^ T / c-^(po(z;k)dz 

J-k 

= A(k)e~ e °^ T , 



with A(k) = V^A(k) 2 . 

We now compute explicitly the asymptotic behaviour 
of both 9 (k) and A{k): 



We show in Fig. Q] the functions 6o(k) and A(k) com- 
puted numerically from the exact solution, together with 
their asymptotic analytic expressions. In intermediate 
values of A; (roughly between 0.5 and 3) these asymptotic 
expressions fail to reproduce the exact solution. 



Higher modes and validity of the asymptotic 
(N > 1) solution. 

Higher modes v > with energy gaps < 1/T 
must in principle be kept in the pre-asymptotic computa- 
tion. This however is irrelevant in practice since the gap 
9\ — 9q is never small. Indeed, <pi(z; k) is proportional to 
the asymmetric solution y_(#i(fc);z) and its energy 



9i(k) 



inf it 

e>e (k) l 



y-(0;k) = 0} 
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0o(k) 



1/Ai(k) 





FIG. 1: Top: Dependence of the exponent 80 on k; similar 
to Fig. 2 in Ref. [fl] — see in particular Eqs. (9b) and (12) 
there. Bottom: Dependence of the prefactor A on k. The 
red plain lines illustrate the analytical behavior in the limiting 
cases k — > and k — > 00. 



is found numerically to be very close to 1 + A6o(k). In 
particular, Ai > 1 (as we illustrate in Fig. [5J and thus 
TAi 3> 1 will always be satisfied in cases of interest. 



IV. BACK TO GOF TESTING AND 
CONCLUSION 



Let us now come back to GoF testing. In the case of a 
constant weight, corresponding to the classical KS test, 
the probability V < (k\a = 0, 6 = 0) is well denned and has 
the well known KS form Q: 

00 

7><(*|a = 0, 6 = 0) = 1 - 2 ^(-l) n - 1 e- 2 " 2fc2 , 



which, as expected, grows from to 1 as k increases. The 
value k* such that this probability is 95% is k* ps 1.358 
Q. This can be interpreted as follows: if, for a data 



FIG. 2: 1/Ai(fe) saturates to 1, so that the condition 
N 2> exp(l/Ai(fc)) is virtually always satisfied. 



set of size N, the maximum value of Y(u) is larger than 
« 1.358/ y/N, then the hypothesis that the proposed dis- 
tribution is a "good fit" can be rejected with 95% confi- 
dence. 

In order to convert the above calculations into a mean- 
ingful test, one must specify values of a and 6. The natu- 
ral choice is a — 1/N and 6 = 1 — a, corresponding to the 
quantiles of the min and the max of the sample series. 

Indeed, a = F(mmz) « J2n=i l{z„<min 2 } = 37, and 
similarly for 6. Correspondingly, the relevant value of T 
is given, according to Eq. (0} above, by 



T = log , 



/fe(l-o) 
a(l - 6) 



log N, 



iV>l. 



This leads to our central result for the cdf of the weighted 
maximal Kolmogorov distance K(jh±, jyrj) under the 
hypothesis that the tested and the true distributions co- 
incide: 



S(N;k) =V < (k\logN) = A{k)N- e ° {h) 



(13) 



which is valid whenever N 3> 1 since, as we discussed 
above, the energy gap Ai is greater than unity. 

The final cumulative distribution function (the test 
law) is depicted in Fig.|3]for different values of the sample 
size N. Contrarily to the standard KS case, this distribu- 
tion still depends on N. In particular, the threshold value 
k* corresponding to a 95% confidence level increases with 
N. Since for large N, k* ^> 1 one can use the asymptotic 
expansion above which soon becomes quite accurate, as 
shown in Fig. [3] This leads to: 



>(**) 



ln0.95 
In TV 



k* e" 



which gives k* w 3.439,3.529,3.597,3.651 for, respec- 
tively, N = 10 3 ,10 4 ,10 5 ,10 6 . For exponentially large 
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N and to logarithmic accuracy one has: k* ~ V2 In In TV. 
This variation is very slow, but one sees that as a mat- 
ter of principle, the "acceptable" maximal value of the 
weighted distance is much larger (for large N) than in 
the KS case. 

S(N;k) 




In conclusion, we believe that accurate GoF tests for 
the extreme tails of empirical distributions is a very im- 
portant issue, relevant in many contexts. We have de- 
rived exact asymptotic results for a generalization of the 
Kolmogorov-Smirnov test, well suited to test these ex- 
treme tails. Our final results are summarized in Fig. [3] 
In passing, we have rederived and made more precise the 
result of Krapivsky and Redner [l| concerning the sur- 
vival probability of a diffusive particle in an expanding 
cage. It would be interesting to exhibit other choices of 
weight functions that lead to soluble survival probabili- 
ties. It would also be interesting to extend the present 
results to multivariate distributions, and to dependent 
observations, along the lines of Ref. 



FIG. 3: Dependence of S(N; k) on k for N = 10 3 , 10 4 , 10 5 , 10 6 
(from left to right). As N grows toward infinity, the curve is 
shifted to the right, and eventually S(oo; k) is zero for any k. 
The red plain lines illustrate the analytical behavior in the 
limiting cases k — > and k — s> oo. The horizontal grey line 
corresponds to a 95% confidence level. 
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